Creating Customer Segments

Project Summary

In this project we look to apply unsupervised learning techniques to create customer segments. This is done by analysing a dataset containing data on various customer's annual spending amounts (reported in monetary units) of diverse product categories. A great blog post on customer segmentation is found here.


Aim of Project

A goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. In achieving such, this can have several benefits for the distributor. Mainly, it would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer. This is a critical issue and practical problem that many distributors and organisations alike face.

The other goal of the project is to simply group customers in clusters of similar spending characteristics. This equips an organisation to better understand their audience and curate product production plans to meet demand timely.


Data Description

The dataset for this project can be found on the UCI Machine Learning Repository. The site currently maintains 622 data sets as a service to the machine learning community.

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. Below is a list and description of the variables in the data set:

1) FRESH: annual spending (m.u.) on fresh products (Continuous); 2) MILK: annual spending (m.u.) on milk products (Continuous); 3) GROCERY: annual spending (m.u.)on grocery products (Continuous); 4) FROZEN: annual spending (m.u.)on frozen products (Continuous) 5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous) 6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous); 7) CHANNEL: customers channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal); 8) REGION: customers region - Lisnon, Oporto or Other (Nominal)

More information can be found here.

For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.


Setup

We begin by loading the necessary packages for the project. We also import the .py file we created for adding some supplementary visualisations. We also proceed to load the data for the project.

In [138]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

# Show matplotlib plots inline
%matplotlib inline

# Import supplementary code visuals.py
import visuals as vs
In [7]:
# Load the dataset
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

Here we shall begin having a look at our data through a serries of different plots and code to understand how each feature is related to the others.

As mentioned the data set has six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'. Let's produce a quick summary of the variables in our data set.

In [8]:
# Summary
display(data.describe().transpose())
count mean std min 25% 50% 75% max
Fresh 440.0 12000.297727 12647.328865 3.0 3127.75 8504.0 16933.75 112151.0
Milk 440.0 5796.265909 7380.377175 55.0 1533.00 3627.0 7190.25 73498.0
Grocery 440.0 7951.277273 9503.162829 3.0 2153.00 4755.5 10655.75 92780.0
Frozen 440.0 3071.931818 4854.673333 25.0 742.25 1526.0 3554.25 60869.0
Detergents_Paper 440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0
Delicatessen 440.0 1524.870455 2820.105937 3.0 408.25 965.5 1820.25 47943.0

We see that the ranges appear to be quite vast for all the variables. Lets better visualise these results using a box plot.

In [14]:
# Box-Plots of Summary for Variables
sns.set(rc = {'figure.figsize':(18,12)})        # used for controls of seaborn plot

plt.subplot(3, 2, 1)
sns.boxplot(y=data["Fresh"], width=0.3);

plt.subplot(3, 2, 2)
sns.boxplot(y=data["Milk"], width=0.3);


plt.subplot(3, 2, 3)
sns.boxplot(y=data["Grocery"], width=0.3);


plt.subplot(3, 2, 4)
sns.boxplot(x=data["Frozen"], width=0.3);

plt.subplot(3, 2, 5)
sns.boxplot(x=data["Detergents_Paper"], width=0.3);

plt.subplot(3, 2, 6)
sns.boxplot(x=data["Delicatessen"], width=0.3);


plt.tight_layout()
plt.show()

We see that the variables are all highly positively (right) skewed distributed, with large outlying values.

We can select a subsample, to aid us in better understanding (some of) the customers and how their data will transform through the analysis. We select a few sample data points and explore them in more detail below. We messed around with various indices (which will represent the customers to track) in order to obtain customers that vary significantly from one another.

In [15]:
# Customers to track
indices = [26,176,392]
In [18]:
# Create DataFrame of Subsample
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset: \n")
display(samples)
Chosen samples of wholesale customers dataset: 

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 9898 961 2861 3151 242 833
1 45640 6958 6536 7368 1532 230
2 518 4180 3600 659 122 654

A possible question we could be posed with, is to discern just by viewing the expenditure of these three customers, what kind of establishment (customer) could each of the three samples we’ve chosen represent?

Customer 1

  • might be from a restaurant. We see high amounts of Frozen, close to median amount of Fresh and Deli. So this can be from a restaurant.
In [25]:
#Custoemr 1
print("Customer 1:")
display(samples.loc[0])
Customer 1:
Fresh               9898
Milk                 961
Grocery             2861
Frozen              3151
Detergents_Paper     242
Delicatessen         833
Name: 0, dtype: int64

Customer 2

  • might be from a supermarket. We see really high or close to median levels of purchases of all category of products excluding deli. This simply could be due to the fact that the supermarket does not have a deli. This would explain all the other purchases.
In [29]:
#Custoemr 2
print("Customer 2:")
display(samples.loc[1])
Customer 2:
Fresh               45640
Milk                 6958
Grocery              6536
Frozen               7368
Detergents_Paper     1532
Delicatessen          230
Name: 1, dtype: int64

Customer 3

  • might represent a café. We see a high purchase of milk and somewhat close to median levels for Groceries and Deli. We also see a relatively lower purchase of fresh produce and frozen goods - a mark of most café's where hot drinks and quick easy meals are mostly made.
In [28]:
#Custoemr 3
print("Customer 3:")
display(samples.loc[2])
Customer 3:
Fresh                518
Milk                4180
Grocery             3600
Frozen               659
Detergents_Paper     122
Delicatessen         654
Name: 2, dtype: int64

To compare the three customers, look below as we plot the sample and their counts, as well as the means.

In [39]:
# Comparison of Samples

# Get the means
mean_data = data.describe().loc['mean', :]

# Append means to the samples' data
samples_bar = samples.append(mean_data)

# Construct indices
samples_bar.index = indices + ['mean']

# Plot bar plot
samples_bar.plot(kind='bar', figsize=(14,8));
/var/folders/y5/7wlh6jzj03185p7x7ggqb_280000gn/T/ipykernel_20724/1788349671.py:7: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  samples_bar = samples.append(mean_data)
Out[39]:
<AxesSubplot:>

for Customer 2 (index 176), it has a very large spend on Fresh items compared to the other customers and even in comparison to the mean of the dataset (of all customers).

Some extremes for other variables are also reflected/highlighted below in the comparison of percentiles plot.

In [40]:
# Percentiles comparison

# First, calculate the percentile ranks of the whole dataset.
percentiles = data.rank(pct=True)

# Then, round it up, and multiply by 100
percentiles = 100*percentiles.round(decimals=3)

# Select the indices you chose from the percentiles dataframe
percentiles = percentiles.iloc[indices]

# Now, create the heat map using the seaborn library
sns.heatmap(percentiles, vmin=1, vmax=99, annot=True);

If we had to consider other customers, lets use indices 43, 12 and 39. The resulting sample is shown below.

In [41]:
indices2 = [43, 12, 39]
# Create DataFrame of Subsample
samples2 = pd.DataFrame(data.loc[indices2], columns = data.keys()).reset_index(drop = True)
print("Chosen second samples of wholesale customers dataset: \n")
display(samples2)
Chosen second samples of wholesale customers dataset: 

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 630 11095 23998 787 9529 72
1 31714 12319 11757 287 3881 2931
2 56159 555 902 10002 212 2916

Looking at the quantities if the annual expenditures. We can perhaps deduce or speculate the type of establishment. We tried to be more specific here, for fun.

  • Index 0: Coffee Cafe

      - Low spending on "Fresh", "Frozen" and "Delicatessen".
      - Majority of spending on "Grocery", "Milk" and "Detergents_Paper".
      - With some spending on "Delicatessen", it may be a cafe establishment serving drinks, coffee perhaps, with some ready-made food as a complimentary product.
    

  • Index 1: Upscale Restaurant

      - Low spending on "Frozen".
      - Majority of spending is a mix of "Fresh", "Milk, and "Grocery"
      - This may be an upscale restuarant with almost no spending on frozen foods.
      - Most upscale restaurants only use fresh foods.
    

  • Index 2: Fresh Food Retailer

      - Majority of spending is on "Fresh" goods with little spending on everything else except on "Frozen".
      - This may be a grocery store specializing in fresh foods with some frozen goods.

An interesting question arises when we look at the data. Is it possible that maybe one or more of the six product categories is actually not relevant for understanding customer purchasing? That is, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products?

This is a clear deviation from our data exploration but is an interesting question to which we can simply answer by running a regression. That is, we shall train a simple supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

We do this with the feature MILK removed (we need to predict that given the other variable consumptions).

In [30]:
# Import Modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# New Data without Milk (need to predict it)
new_data = data.drop(['Milk'],axis=1)

# Create Train and test data sets
X_train, X_test, y_train, y_test = train_test_split(new_data,data['Milk'],test_size=0.25,random_state=101)

# fit regression decision tree (CART)
regressor = DecisionTreeRegressor(random_state=101).fit(X_train,y_train)

# Get Prediction
score = regressor.score(X_test,y_test)

print(score)
0.29571438444092835

We trained the model with 75% the data from our data set (original). We tested our model using the remaining 25% of customers records. So we tried to predict the ‘Milk’ feature (i.e. annual spending on milk products), based on the other features in the dataset and our model achieved a predicted $R^2$ score was 0.2957. As we know that the R2 is between 0 and 1, the model we built for customer’s milk purchasing habits isn’t very good, although it is possible that there’s some correlation between this feature and others.

It’s safe to say that the ‘Milk’ feature is necessary for identifying customer’s spending habits because it isn’t possible to predict how a customer spends on Milk based on their spending on the other product categories. We can say that the ‘Milk’ feature adds extra (and maybe key) information to the data which is not easily inferable by model only through looking at the other features.

Suppose we wanted to find the R2 score for each variable and not just predict Milk. But maybe perhaps, create a loop to treat each variable as the target. Let's try to do that below.

In [46]:
# Create list to loop through
dep_vars = list(data.columns)

# Create loop to test each feature as a dependent variable
for var in dep_vars:

    #'drop' function to drop the given feature
    new_data = data.drop([var], axis = 1)

    # Create feature Series
    new_feature = pd.DataFrame(data.loc[:, var])

    # Split the data into training and testing sets using the given feature as the target
    X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=101)

    # Create a decision tree regressor and fit it to the training set
    dtr = DecisionTreeRegressor(random_state=101)
    dtr.fit(X_train, y_train)

    # Report the score of the prediction using the testing set (Returns R^2)
    score = dtr.score(X_test, y_test)
    print('R2 score for {} as dependent variable: {}'.format(var, score))
R2 score for Fresh as dependent variable: -1.225551746474062
R2 score for Milk as dependent variable: 0.29571438444092835
R2 score for Grocery as dependent variable: 0.6360717396234632
R2 score for Frozen as dependent variable: -0.35798453505334815
R2 score for Detergents_Paper as dependent variable: 0.6895522615942111
R2 score for Delicatessen as dependent variable: 0.13556294884739872

So essentially what we did was use a loop and predicted every single feature as a dependent variable with the results shown above.

We see that "Fresh" and "Frozen" as dependent variables have negative $R^2$ scores. Their negative scores imply that they are necessary for identifying customers' spending habits because the remaining features cannot explain the variation in them.

Similarly, "Milk" and "Delicatessen" have very low R2 scores. Their low scores also imply that they are necessary for identifying customers' spending habits.

We now turn our attention to the whole data set again. we can construct a scatter matrix of each of the six product features present in the data.

Apart from simply finding associations between variables in our data set this scatter plot matrix can also help infer on an earlier question we asked. That is, we can investigate the manner or way in which MILK may or may not be related to the other variables.

In [43]:
# Produce a scatter matrix
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (20,10), diagonal = 'kde');

Looking at the plot above, we get the realtionship between each variable with one another. We also get on the diagonal, the density distribution of that variable.

We can identify that there are a few pairs of features that exhibit some degree of correlation. They include:

  • Milk and Groceries
  • Milk and Detergents_Paper
  • Grocery and Detergents_Paper

  • As we tried to predict the ‘Milk’ feature earlier, we found that it was in fact not really possible to predict how a customer spends on Milk based on their spending on the other product categories. The correlation results above confirms the suspicion that Milk isn’t correlated to most of the features in the dataset, although it shows a releatively mild correlation with ‘Groceries’ and ‘Detergents_Paper’.

More over, These features that are strongly correlated (i.e., grocery and detergents_paper) does lend credence to our initial claim that Grocery may not be necessary for identifying customers' spending habits.

  • Grocery has a high correlation with Detergents_Paper and Milk that corresponds to a relatively high R2 score when we regress Grocery on all other features. We saw an R2 score of 0.636 and 0.689 for Grocery and Detergent_papers respectively.

The distribution of all the features appears to be similar. It is strongly right skewed as we have suspected from the box plots earlier on. That is, most of the data points fall in then first few intervals. Judging by the summary statistics, especially the mean and maximum value points, of the features that we calculated earlier, we can expect that there are some outliers in each of the distributions. We also saw there were several outliers when we drew up the box plots.

The data are not normally distributed due to the presence of many outliers. This indicates how normalization is required to make the data features normally distributed as many clustering algorithms (which we shall look at later) require them to be normally distributed.

Let's get the actual correlation heatmap for the data.

In [44]:
sns.set(rc = {'figure.figsize':(20,15)})
sns.heatmap(data.corr(), annot=True, cmap = "summer");            # heatmap with seaborn
Out[44]:
<AxesSubplot:>

This is to cross-reference with the scatter matrix above to draw more accurate insights from the data. The higher the color is on the bar, the higher the correlation.


Data Pre-Processing

Here we shall seek to preprocess our data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers.

That is, we shall conduct:

  1. Feature Scaling
  2. Outlier Detection

This is all to ensure our results you obtain from your analysis are significant and meaningful.

Feature Scaling

We noticed in our box plots, density plots and summary statistics, that the distributions of the variables are not normal. Specifically, the mean and median vary significantly (indicating a large skew). This often indicates non-linear scaling is required.

One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness.

A simpler approach which can work in most cases would be applying the natural logarithm. We do this very method below.

In [47]:
# Scale the data using the natural logarithm
log_data = data.apply(lambda x: np.log(x))

# Scale the sample data using the natural logarithm
log_samples = samples.apply(lambda x: np.log(x))
In [48]:
# Produce a scatter matrix - transformed data
pd.plotting.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

After applying a natural logarithm scaling to the data, the distribution of each feature appears much more normal. Moreover we still see that the pairs of variables which were correlated before, ar still correlated after the transformation.

In [53]:
print("Correlation Matrix (Original Data):")
display(data.corr())
print("Correlation Matrix (Transformed Data):")
display(log_data.corr())
Correlation Matrix (Original Data):
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Fresh 1.000000 0.100510 -0.011854 0.345881 -0.101953 0.244690
Milk 0.100510 1.000000 0.728335 0.123994 0.661816 0.406368
Grocery -0.011854 0.728335 1.000000 -0.040193 0.924641 0.205497
Frozen 0.345881 0.123994 -0.040193 1.000000 -0.131525 0.390947
Detergents_Paper -0.101953 0.661816 0.924641 -0.131525 1.000000 0.069291
Delicatessen 0.244690 0.406368 0.205497 0.390947 0.069291 1.000000
Correlation Matrix (Transformed Data):
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Fresh 1.000000 -0.019834 -0.132713 0.383996 -0.155871 0.255186
Milk -0.019834 1.000000 0.758851 -0.055316 0.677942 0.337833
Grocery -0.132713 0.758851 1.000000 -0.164524 0.796398 0.235728
Frozen 0.383996 -0.055316 -0.164524 1.000000 -0.211576 0.254718
Detergents_Paper -0.155871 0.677942 0.796398 -0.211576 1.000000 0.166735
Delicatessen 0.255186 0.337833 0.235728 0.254718 0.166735 1.000000

Upon closer investigation we do see that the correlation appears to be weaker though.

  • Grocery and Detergents_Paper has a weaker correlation.
  • Grocery and Milk has a slightly stronger correlation.
  • Detergents_Paper and Milk has a slightly stronger correlation

Outliers

Identifying outliers in data is an important part of statistical analyses. The presence of outliers can often skew results which take into consideration these data points.

There are many methods out there and rules of thumb that describes or attributes what makes an outlier. We make use of Tukey's method for identifying outliers. It is a simple rule for finding outliers and is based on the quartiles of the data. Essentially what the method describes is that: an outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

That is; the first quartile $Q1$ is the value ≥ 1/4 of the data, the second quartile $Q2$ or the median is the value ≥ 1/2 of the data, and the third quartile $Q3$ is the value ≥ 3/4 of the data. The inter-quartile range, IQR, is $Q3 − Q1$. Tukey’s rule says that the outliers are values more than 1.5 times the inter-quartile range from the quartiles — either below $Q1 − 1.5IQR$, or above $Q3 + 1.5IQR$ (Tukey's Method).

In [54]:
log_data.keys()
Out[54]:
Index(['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper',
       'Delicatessen'],
      dtype='object')
In [56]:
# Select the indices for data points that we  wish to remove
outliers  = []

# For each feature (keys) find the data points with extreme high or low values
for feature in log_data.keys():

    # Calculate Q1
    Q1 = np.percentile(log_data[feature],25)

    # Calculate Q3 for the given feature
    Q3 = np.percentile(log_data[feature],75)

    # calculate an outlier step (1.5 times the interquartile range)
    step = (Q3-Q1) * 1.5

    # Display the outliers
    print("Data points considered outliers for the feature '{}':".format(feature))
    out = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
    display(out)
    outliers = outliers + list(out.index.values)


#Creating list of more outliers which are the same for multiple features.
outliers = list(set([x for x in outliers if outliers.count(x) > 1]))

print("Outliers: {}".format(outliers))

# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
print("The good dataset now has {} observations after removing outliers.".format(len(good_data)))

# Original Data
print('Original shape of data:\n', log_data.shape)
# Processed Data
print('New shape of data:\n', good_data.shape)
Data points considered outliers for the feature 'Fresh':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
81 5.389072 9.163249 9.575192 5.645447 8.964184 5.049856
95 1.098612 7.979339 8.740657 6.086775 5.407172 6.563856
96 3.135494 7.869402 9.001839 4.976734 8.262043 5.379897
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
171 5.298317 10.160530 9.894245 6.478510 9.079434 8.740337
193 5.192957 8.156223 9.917982 6.865891 8.633731 6.501290
218 2.890372 8.923191 9.629380 7.158514 8.475746 8.759669
304 5.081404 8.917311 10.117510 6.424869 9.374413 7.787382
305 5.493061 9.468001 9.088399 6.683361 8.271037 5.351858
338 1.098612 5.808142 8.856661 9.655090 2.708050 6.309918
353 4.762174 8.742574 9.961898 5.429346 9.069007 7.013016
355 5.247024 6.588926 7.606885 5.501258 5.214936 4.844187
357 3.610918 7.150701 10.011086 4.919981 8.816853 4.700480
412 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134
Data points considered outliers for the feature 'Milk':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
86 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723
98 6.220590 4.718499 6.656727 6.796824 4.025352 4.882802
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
356 10.029503 4.897840 5.384495 8.057377 2.197225 6.306275
Data points considered outliers for the feature 'Grocery':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
Data points considered outliers for the feature 'Frozen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
38 8.431853 9.663261 9.723703 3.496508 8.847360 6.070738
57 8.597297 9.203618 9.257892 3.637586 8.932213 7.156177
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
145 10.000569 9.034080 10.457143 3.737670 9.440738 8.396155
175 7.759187 8.967632 9.382106 3.951244 8.341887 7.436617
264 6.978214 9.177714 9.645041 4.110874 8.696176 7.142827
325 10.395650 9.728181 9.519735 11.016479 7.148346 8.632128
420 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
429 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
439 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244
Data points considered outliers for the feature 'Detergents_Paper':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
161 9.428190 6.291569 5.645447 6.995766 1.098612 7.711101
Data points considered outliers for the feature 'Delicatessen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
109 7.248504 9.724899 10.274568 6.511745 6.728629 1.098612
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
137 8.034955 8.997147 9.021840 6.493754 6.580639 3.583519
142 10.519646 8.875147 9.018332 8.004700 2.995732 1.098612
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
183 10.514529 10.690808 9.911952 10.505999 5.476464 10.777768
184 5.789960 6.822197 8.457443 4.304065 5.811141 2.397895
187 7.798933 8.987447 9.192075 8.743372 8.148735 1.098612
203 6.368187 6.529419 7.703459 6.150603 6.860664 2.890372
233 6.871091 8.513988 8.106515 6.842683 6.013715 1.945910
285 10.602965 6.461468 8.188689 6.948897 6.077642 2.890372
289 10.663966 5.655992 6.154858 7.235619 3.465736 3.091042
343 7.431892 8.848509 10.177932 7.283448 9.646593 3.610918
Outliers: [128, 65, 66, 75, 154]
The good dataset now has 435 observations after removing outliers.
Original shape of data:
 (440, 6)
New shape of data:
 (435, 6)

Upon quick inspection, our sample doesn't contain any of the outlier values. The good data now is a matrix that measures (435x6) instead of the original dimensionality of (440x6).

There were 5 data points that were considered outliers for more than one feature based on our definition above. So, instead of removing all outliers (which would result in us losing a lot of information), only outliers that occur for more than one feature are removed; i.e., were outliers in multiple features. That is; they were removed as they are not only outliers in one category but more than once. Hence, they are not representative of our general customers.

More information regarding whether to drop outliers is provided by the article here.


Principal Component Analysis (PCA)

Here we shall use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

More specfically; PCA refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since it involves only a set of features $X_1, X_2, . . . , X_p$ and no associated response $Y$. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the $p$ features.

TLDR; PCA starts with the data-set being normalized and then the eigen-analysis of the covariance matrix is done. Then to reduce the dimension, the data-set is projected onto the first few principal components. It is am method to reduce dimensionality of data, whilst retaining as much information as possible in the data.

Previously using natural logartihms we were able to scale the data to a more normal distribution and we have had any necessary outliers removed. As such, we can now apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved.

Note that a component (dimension) from PCA can be considered a new “feature” of the space, however it is a composition of the original features present in the data.

In [57]:
# Import PCA module
from sklearn.decomposition import PCA

# Apply PCA by fitting the good data
pca = PCA().fit(good_data)

# Transform log_samples using the PCA fit
pca_samples = pca.transform(log_samples)

# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)

If we look at the loadings for each component:

  • Dimension 1 has a high positive weight for Milk, Grocery, and Detergents_Paper features. These variables load highly on this component. This might represent Hotels, where these items are usually needed for the guests. This is in line with our initial findings where the 3 features are highly correlated. An increase in PC1 is associated with large increases in "Milk", "Grocery" and "Detergents_Paper" spending.
  • Dimension 2 has a high positive weight for Fresh, Frozen, and Delicatessen. This dimension might represent 'restaurants', where these items are used for ingredients in cooking dishes. We also note very low loading/contribution by Detergents_Paper. An increase in PC2 is associated with large increases in "Fresh", "Frozen" and "Delicatessen" spending.
  • Dimension 3 has a high positive weight for Deli and Frozen features, and a low positive weight for Milk, but has negative weights for everything else. This dimension might represent Delis. An increase in PC3 is associated with a large increase in "Delicatessen" and a large decrease in "Fresh" spending

  • Dimension 4 has positive weights for Frozen, Detergents_Paper and Groceries, while being negative for Fresh and Deli. It's a bit tricky to pin this segment down, but I do believe that there are shops that sell frozen goods exclusively. An increase in PC4 is associated with a large increasing in "Frozen" and a large decrease in "Delicatessen" spending.

In addition to finding these dimensions, PCA also reports the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone.

We see that the first 4 components account for over 90% of the variation in the data. In fact, The first and second features, in total, explain approx. 70.8% of the variance in our data. The first four features, in total, explain approx. 93.11% of the variance. However, to establish how many components we shall use and need, we can use a scree plot.

In [58]:
PC_values = np.arange(pca.n_components_) + 1
plt.plot(PC_values, pca.explained_variance_ratio_, 'o-', linewidth=2, color='blue')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.show()

The x-axis displays the principal component and the y-axis displays the percentage of total variance explained by each individual principal component. We look for an elbow in the plot. A clear elbow is at 3 components. We shall use 4 given than we can explain over 90% of the data variation.

We can also use the following code to display the exact percentage of total variance explained by each principal component:

print(pca.explained_variance_ratio_)
In [59]:
# Explained Variance by PC's
print(pca.explained_variance_ratio_)
[0.44302505 0.26379218 0.1230638  0.10120908 0.04850196 0.02040793]

As mentioned before and shown in our earlier plot, the first principal component explains 44.3% of the total variation in the dataset. The second principal component explains 26.4% of the total variation. And so on.

Note, that one of the main goals of using PCA is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost:

  • fewer dimensions used implies less of the total variance in the data is being explained.

Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. We have established that using just four principal components, we can explain over 93% of the variation in the data.

As such, since we have a significant amount of variance is explained by only two or three (around 80%) dimensions, the reduced data can be visualized afterwards in a bi-plot.

In [60]:
# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2).fit(good_data)

# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)

# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
In [62]:
# Create a biplot
vs.biplot(good_data, reduced_data, pca);
Out[62]:
<AxesSubplot:title={'center':'PC plane with original feature projections.'}, xlabel='Dimension 1', ylabel='Dimension 2'>

A bi-plot is a scatter plot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1 and Dimension 2). In addition, the bi-plot shows the projection of the original features along the components. A bi-plot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

The original feature projections are in red. A point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk', 'Grocery' and 'Detergents_Paper', but not so much on the other product categories.

We can also deduce a lot of other information from the biplot. Essentially, the biplot shows where the points are the projected observations and the vectors are the projected variables.

The variance of the data along the principal component directions is associated with the magnitude of the eigenvalues - that is, the length of the vector is a measure of the variance of the data when projected onto that axis. Moreover, the further away these vectors are from a PC origin, the more influence they have on that PC.

Also, the angles between the vectors tell us how characteristics correlate with one another. When two vectors are close, forming a small angle, the two variables they represent are positively correlated.


Clustering Analysis

Clustering analysis concerns itself with grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters).

In this part of the project we shall examine, implement and choose to use either a K-Means clustering algorithm, PAM clustering or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data.

  1. K-Means Clustering Algorithm
  2. Partitioning Around Medoids (PAM) Clustering
  3. Gaussian Mixture Model Clustering

All three methods are classified as Non-Hierarchical Clustering algorithm as opposed to a Hierarchical Clustering method - a key difference is that (for NHC) the desired number of clusters need to be specified. Moreover, this type of clustering (NH Clustering) does not incorporate a tree-like cluster structure. Rather this deals with splitting up and merging clusters.

K-Means

To begin the algorithm, each observation is randomly assigned to each of the specified clusters. Once assigned, the centroids of each cluster are calculated. The objective of the K-means algorithm is to minimize the Squared Euclidean Distance (ESS) for each observation in every cluster. This is to ensure that all observations are close to the centroid of the cluster they lie in. The algorithm iterates by assigning each observation to the closest cluster centroid. This is done to reduce the ESS, and will continue to iterate until no observation moves from the cluster in which they are located in.

Advantages of K-Means clustering:

  • Simple, easy to implement and interpret results.
  • Good for hard cluster assignments i.e. when a data point only belongs to one cluster over the others.

Disadvantages

  • It may converge to a local optima depending on your initialization of clusters.
  • It may be computationally expensive to compute Euclidean distances.
  • It is susceptible to outliers (can pre-process our data to exclude outliers to solve this)

Algorithm:

  1. Select the value of K, to decide the number of clusters to be formed.
  2. Select random K points which will act as centroids.
  3. Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.
  4. place a new centroid of each cluster.
  5. Repeat step no.3, which reassign each data point to the new closest centroid of each cluster.
  6. If any reassignment occurs, then go to step-4 else go to Step 7.
  7. Complete

In order to assess the most suitable number of clusters, an Elbow Plot will be examined, below. We Can Choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method. WCSS Stands for the sum of the squares of distances of the data points in each and every cluster from its centroid.

The process:

  1. Execute the K-means clustering on a given dataset for different K values (ranging from 1-10).

  2. For each value of K, calculates the WCSS value.

  3. Plots a graph/curve between WCSS values and the respective number of clusters K.

  4. The sharp point of bend or a point( looking like an elbow joint ) of the plot like an arm, will be considered as the best/optimal value of K

In [104]:
#import module
from sklearn.cluster import KMeans

# Data
x = reduced_data

# add a bit of flair
kmeans_kwargs = {
    "init": "random",
    "n_init": 10,
    "max_iter": 300,
    "random_state": 101,
}

# Initialise Within-Cluster-Sum-of-Squares (WSS)
wss=[]

# loop to get wss
for i in range(1, 7):
    kmeans = KMeans(n_clusters = i, **kmeans_kwargs)
    kmeans.fit(x)
    wss_iter = kmeans.inertia_
    wss.append(wss_iter)

# plot elbow plot with wss
number_clusters = range(1,7)
plt.plot(number_clusters, wss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Out[104]:
Text(0, 0.5, 'WCSS')

We shall use 2 clusters - as indicated by the elbow. Note if we are having trouble choosing the elbow point of the curve, then you could use a Python package, kneed, to identify the elbow point programmatically.

In [107]:
# modules
from kneed import KneeLocator

# Apply program to choose k
kl = KneeLocator(
    range(1, 7), wss, curve="convex", direction="decreasing"
)
print(kl.elbow)
2

We can also apply the silhouette method to choose number of clusters. See below code.

In [151]:
# modules
from sklearn.metrics import silhouette_score

# A list holds the silhouette coefficients for each k
silhouette_coefficients = []

# Notice you start at 2 clusters for silhouette coefficient
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
    kmeans.fit(reduced_data)
    score = silhouette_score(reduced_data, kmeans.labels_)
    silhouette_coefficients.append(score)
In [103]:
 plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
In [129]:
# apply kmenas algorithm with n clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

# Show first 5 cluster assignments
identified_clusters = kmeans.fit_predict(x)
identified_clusters[0:5]
Out[129]:
array([1, 1, 1, 0, 1], dtype=int32)
In [130]:
# plot Clusters
plt.scatter(reduced_data["Dimension 1"], reduced_data["Dimension 2"], c=kmeans.labels_)
plt.show()

Gaussian Mixture Models

Instead of Hard assigning data points to a cluster, if we are uncertain about the data points where they belong or to which group, we use this method. It uses probability of a sample to determine the feasibility of it belonging to a cluster.

Advantages of Gaussian Mixture Model clustering:

  • Due to the nature of soft assignments, a point can belong to two clusters with varying degree (probability).
  • Does not bias the cluster sizes to have specific structures in the cluster that may or may not exist.

Disadvantages:

  • It may converge to a local optima depending on your initialization of clusters.
  • It is a much more complicated model to interpret

Given the soft assignment of points it may be more well suited for our wholesale customer data (i.,e., there may be a mixed membership problem in our dataset where there is no clear demarcation), Gaussian Mixture model clustering might be the better option over the other two algorithms. '

A problem with hard assignment is that there might be some hidden patterns in the data that we may miss by assigning only one cluster to each data point. For example, let’s take the case of the Supermarket customer in our sample: while doing PCA, it had similar and high positive weights for multiple dimensions, i.e. it didn’t belong to one dimension over the other. So a supermarket may be a combination of a fresh produce store/grocery store/frozen goods store.

As with K-Means algorithm, the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the “goodness” of a clustering by calculating each data point’s silhouette coefficient.

Again, (as we previously mentioned under K-Means) the silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

In [141]:
# import modules
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score


# Create range of clusters
n_clusters = list(range(2,11))
print("Different Number of Clusters (K) Tried: ", n_clusters)

# for each clusters k, generate silhouette score
for n in n_clusters:

    # Apply your GMM algorithm to reduced data
    clusterer = GaussianMixture(n_components=n).fit(reduced_data)

    # Predict the cluster for each data point
    preds = clusterer.predict(reduced_data)

    # Find the cluster centers
    centers = clusterer.means_

    # Predict the cluster for each transformed sample data point
    sample_preds = clusterer.predict(pca_samples)

    # Calculate the mean silhouette coefficient for the number of clusters chosen
    score = silhouette_score(reduced_data,preds, metric="mahalanobis")

    print("The silhouette_score for {} clusters is {}".format(n,score));
Different Number of Clusters (K) Tried:  [2, 3, 4, 5, 6, 7, 8, 9, 10]
The silhouette_score for 2 clusters is 0.37598604546578696
The silhouette_score for 3 clusters is 0.33185526554634676
The silhouette_score for 4 clusters is 0.29649101365659736
The silhouette_score for 5 clusters is 0.32203042421648487
The silhouette_score for 6 clusters is 0.3089919910836205
The silhouette_score for 7 clusters is 0.3156021558286314
The silhouette_score for 8 clusters is 0.29969064151137104
The silhouette_score for 9 clusters is 0.24632499832191426
The silhouette_score for 10 clusters is 0.29759896068585295

Note, that the Silhouette Coefficient is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Therefore, it makes sense to use the same distance metric here as the one used in the clustering algorithm. This is Euclidean for KMeans and Mahalanobis for general GMM.

Of the several cluster numbers tried, 2 clusters had the best silhouette score. Similarly, when we use KMeans, the best score is also obtained with the same number of clusters. Let's visualise this clustering.

In [144]:
# Apply your GMM algorithm to reduced data
clusterer = GaussianMixture(n_components=2).fit(reduced_data)

# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)

# Find the cluster centers
centers = clusterer.means_

# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)

# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster’s center point corresponds to the average customer of that segment.

Note that we had earlier reduced the original data in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations. We attempt this below.

In [145]:
# Inverse transform the centers
log_centers = pca.inverse_transform(centers)

# Exponentiate the centers
true_centers = np.exp(log_centers)

# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Segment 0 8953.0 2114.0 2765.0 2075.0 353.0 732.0
Segment 1 3552.0 7837.0 12219.0 870.0 4696.0 962.0

If we consider the summary statistics for the product categories and the total purchase cost of each product category for the representative data points above, we can perhaps try to deduce what set of establishments could each of the customer segments represent.

  • Segment 0: Taking an educated guess,

  • Segment 0: best represents supermarkets. They spend a higher than median amount on Milk, Grocery, Detergents_Paper and Deli, which are both essential to be stocked in such places.

  • Segment 1: best represents restaurants. Their spend on Fresh, and Frozen is higher than the median, and lower, but still close to median on Deli. Their spend on Milk, Grocery and Detergents_Paper is lower than median, which adds to our assessment.

Let's find which cluster each sample point is predicted to be. The sample points where the sample customers we drew from the original data (indices 26, 176, 392)

In [146]:
# Display the predictions
for i, pred in enumerate(sample_preds):
    print("Sample point", i, "predicted to be in Cluster", pred)
Sample point 0 predicted to be in Cluster 0
Sample point 1 predicted to be in Cluster 0
Sample point 2 predicted to be in Cluster 0
In [147]:
# Purchases from samples
print(samples)
Out[147]:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 9898 961 2861 3151 242 833
1 45640 6958 6536 7368 1532 230
2 518 4180 3600 659 122 654

It is evident that sample point 0 (customer) belongs to cluster 0 (segment 0) where spending on "Milk", "Grocery" And "Detergents_Paper" is high.

When we were drawing up guesses as to what establishments might the sample points represent, we had decided on restaurant, supermarket and café for customers (sample points) 0, 1 and 2 respectively.

Our judgement could be correct for sample points 0 and 2. While incorrect, or rather inconsistent, with our predictions for sample point 1. Looking at the visualization for our cluster in the previous section, it could be that sample 1 is the point close to the boundary of both clusters

Note that, we had decided to drop Channel and Region features early on tin this project. By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset. Each data point is labeled either 'HoReCa' (Hotel/Restaurant/Cafe) or 'Retail' the reduced space. In addition, we find the sample points are circled in the plot, which will identify their labeling.

In [150]:
# Display the clustering results based on 'Channel' data
dup_outliers = [128, 65, 66, 75, 154]
vs.channel_results(reduced_data, dup_outliers, pca_samples)

this underlying distribution of Hotel/Restaurant/Cafe customers to Retailer customers is shown above. The number of clusters we had chosen is consistent with the underlying distribution with 2 major clusters hence the clustering algorithm did well. There are customer segments that would be purely classified as "Retailers" or "Hotels/"Restaurants/Cafes" on the extreme left and right accordingly. This underlying classification is consistent with our observation where we noted cluster 0 customers are typically restaurants and cafes and cluster 1 customers are typically markets.

Discussion

We can now discuss ways in which we could make use of the clustered data. That is; we can consider how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme.

Given that each customer has a customer segment it best identifies with (depending on the clustering algorithm applied), we can consider customer segment as an engineered feature for the data. Feature engineering as such, has many benefits.

  • Higher efficiency of the model.
  • Easier Algorithms that fit the data.
  • Easier for Algorithms to detect patterns in the data.
  • Greater Flexibility of the features

And more simply, assume the wholesale distributor recently acquired new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor can seek to classify each new customer to a customer segment to determine the most appropriate delivery service. This can be done in supervised learning fashion. Specifically, we can label the new customers, the distributor will first need to build and train a supervised learner on the data that we labeled through clustering. The data to fit will be the estimated spends, and the target variable will be the customer segment i.e. 0 or 1 (i.e. grocery store or restaurant). They can then use the classifier to predict segments for new incoming data